# Day 6 Supporting PMUs on RISC-V platforms (二)

2021 iThome 鐵人賽

DAY 6

Software Development

閱讀 Linux Kernel 文件系列第 6 篇

13th鐵人賽

ycliang

團隊晶心壯士 II：SOLAR System

2021-09-15 23:33:52

2381 瀏覽

分享至

今天一樣是 Supporting PMUs on RISC-V platforms 相關的內容，先來簡單回顧昨天的簡介：

perf 是一個效能檢測軟體，可以檢測一個程式執行時，所使用的 cycle 數、cache miss 次數等等。
PMU (Performance Monitoring Unit) 是一個硬體元件，具備基礎的計數功能，作為 perf 運作的基礎
在 RISC-V 1.10 版本的指令集架構規格中，對於 HPM (Hardware Performance Monitor) 相關支援並不多，相較於其他隊 perf 有完整支援的架構，RISC-V 缺少了：
1. 開關計數器的功能：計數器永遠在增加，沒有可以讓計數器暫停的功能
2. 寫入計數器的功能：OS 作為一個 S mode 軟體，並沒有寫入 m mode 暫存器的能力
3. 計數器溢位的中斷 & 中斷指示器：：若計數器溢位，OS 是沒有能力知道計數器溢位
4. 層級辨認機制：perf 可以指定要記錄哪一個層級(user space, kernel space)，但 RISC-V 目前沒有支援

導致了 perf stat 堪用，而 perf record 無法使用的狀況。
接下來，一樣先就文件本身來研究，再來記錄 perf stat 和 perf record 的區別。

註：參考補充資料：RISC-V Perf Tool Status

文件

文件原文：Supporting PMUs on RISC-V platforms
這份文件是用來提供一些準則和列出在移植 perf 到 RISC-V 平台上所需要做的事情；而最近 RISC-V HPM 部分的規格經過重新修訂，標準化了變成一個新的擴充項目，而相對應的程式碼也正在審核中，所以在不久的將來，這篇文件也許就會 deprecated，這邊就不作詳細的翻譯，而是從程式碼來對照著理解。

移植守則(續)

pmu 初始化 (Initialization)：

riscv_pmu 是 RISC-V 平台上，pmu 的一個實例(instance)，預設是指向 riscv_base_pmu 這一個最基礎的(baseline) pmu 實作，不同的實作者可以根據自己的需求來擴充這個資料結構。

/* ${linux}/arch/riscv/kernel/perf_event.c */
static const struct riscv_pmu *riscv_pmu __read_mostly;
static const struct riscv_pmu riscv_base_pmu = {  // pmu 內部資料結構
    .pmu = &min_pmu,
    .max_events = ARRAY_SIZE(riscv_hw_event_map),
    .map_hw_event = riscv_map_hw_event,
    .hw_events = riscv_hw_event_map,
    .map_cache_event = riscv_map_cache_event,
    .cache_events = &riscv_cache_event_map,
    .counter_width = 63,
    .num_counters = RISCV_BASE_COUNTERS + 0,
    .handle_irq = &riscv_base_pmu_handle_irq,
    /* This means this PMU has no IRQ. */
    .irq = -1,
};
static int __init init_hw_perf_events(void)
{
    struct device_node *node = of_find_node_by_type(NULL, "pmu");
    const struct of_device_id *of_id;
    riscv_pmu = &riscv_base_pmu;         // 預設為 riscv_base_pmu

    if (node) {
        of_id = of_match_node(riscv_pmu_of_ids, node);  // 找尋 dts 裡面的 pmu node

        if (of_id)
            riscv_pmu = of_id->data;
        of_node_put(node);
    }

    perf_pmu_register(riscv_pmu->pmu, "cpu", PERF_TYPE_RAW);
    return 0;
}
arch_initcall(init_hw_perf_events); // kernel 在 initcall 初始化 arch 時，會執行這個 function

pmu 事件初始化 (Event Initialization)
+ 使用 perf 時，perf 會執行 perf_event_open 這個系統呼叫 (system call)，接下來就會執行 event_init 裡面的 member。
+ 目前 RISC-V 僅支援 cycle、instruction count 這兩項 event

/* ${linux}/arch/riscv/kernel/perf_event.c */
static const int riscv_hw_event_map[] = {    // baseline 僅支援計算 cycle、instruction count
    [PERF_COUNT_HW_CPU_CYCLES]		= RISCV_PMU_CYCLE,
    [PERF_COUNT_HW_INSTRUCTIONS]		= RISCV_PMU_INSTRET,
    [PERF_COUNT_HW_CACHE_REFERENCES]	= RISCV_OP_UNSUPP,
    [PERF_COUNT_HW_CACHE_MISSES]		= RISCV_OP_UNSUPP,
    [PERF_COUNT_HW_BRANCH_INSTRUCTIONS]	= RISCV_OP_UNSUPP,
    [PERF_COUNT_HW_BRANCH_MISSES]		= RISCV_OP_UNSUPP,
    [PERF_COUNT_HW_BUS_CYCLES]		= RISCV_OP_UNSUPP,
};
static const int riscv_cache_event_map[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX] = {
    [C(L1D)] = {                         // L1 Dcache
        [C(OP_READ)] = {                 // READ 操作
            [C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
            [C(RESULT_MISS)] = RISCV_OP_UNSUPP,
        },
        [C(OP_WRITE)] = {
            [C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
            [C(RESULT_MISS)] = RISCV_OP_UNSUPP,
        },
        [C(OP_PREFETCH)] = {
            [C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
            [C(RESULT_MISS)] = RISCV_OP_UNSUPP,
        },
    },
    ...
};
static int riscv_event_init(struct perf_event *event)
{
    ...
    switch (event->attr.type) {
    case PERF_TYPE_HARDWARE:
        code = riscv_pmu->map_hw_event(attr->config);   // init hardware 相關 event
        break;
    case PERF_TYPE_HW_CACHE:
        code = riscv_pmu->map_cache_event(attr->config);
        break;
    case PERF_TYPE_RAW:
        return -EOPNOTSUPP;
    default:
        return -ENOENT;
    }

    event->destroy = riscv_event_destroy;
    if (code < 0) {
        event->destroy(event);
        return code;
    }
    ...
}
static int riscv_map_hw_event(u64 config)           // 實際 init (mapping) 過程
{
    ...
    return riscv_pmu->hw_events[config];
}

中斷 (Interrupt)

中斷在這邊的用意是，IRQ handler 會運行 overflow 相關的處理，且在 event_init 時，會透過 reserve_pmc_hardware，將這個 service routine 變成 globally 可存取的。
不過目前在 RISC-V 中，是沒有提供這個功能的，而這部份的程式碼也僅僅是一個 stub。

static int reserve_pmc_hardware(void)
{
    int err = 0;

    mutex_lock(&pmc_reserve_mutex);
    if (riscv_pmu->irq >= 0 && riscv_pmu->handle_irq) {
        err = request_irq(riscv_pmu->irq, riscv_pmu->handle_irq,
                  IRQF_PERCPU, "riscv-base-perf", NULL);
    }
    mutex_unlock(&pmc_reserve_mutex);

    return err;
}

存取計數器 (Reading/Writing Counters)
- 這裡 read write 感覺起來很對稱，一個在 read 的時候做、一個在 write 的時候做；然而 perf 中其實並沒有主動做 write counter 的操作，而是只有主動 read counter，很直覺的 read counter 就是在事件開始時，讀一次；結束時，讀一次，然後及可計算出整個執行過程，事件發生的次數。
- write counter 的操作，主要是在 pmu->start 時，要將 counter 設置成一個適當的數值，並且等待 overflow 發生；另一個則是，在 overflow 的 handler 中，把 counter 設回一開始那個適當的數值。
```
static inline u64 read_counter(int idx)
{
    u64 val = 0;

    switch (idx) {
    case RISCV_PMU_CYCLE:
        val = csr_read(CSR_CYCLE);
        break;
    case RISCV_PMU_INSTRET:
        val = csr_read(CSR_INSTRET);
        break;
    default:
        WARN_ON_ONCE(idx < 0 ||	idx > RISCV_MAX_COUNTERS);
        return -EINVAL;
    }

    return val;
}

static inline void write_counter(int idx, u64 value)  // 目前在 S mode 不支援 write counter
{
    /* currently not supported */
    WARN_ON_ONCE(1);
}
```
add()/del()/start()/stop()
- 基本的概念是 add/del：可以增加或是減少事件到 PMU 中； start/stop：則是啟動 counter 或是停止 counter
- 程式碼整個放上來會過於冗長，連結